Case study 2: Linear regression model to predict sales price of properties

Business Objective

This dataset contains properties sold in New York City over a 12-month period from September 2016 to September 2017. The objective is to build a model to predict sale value in the future.

Load Dataset

#Download dataset
#!wget -q https://www.dropbox.com/s/6tc7e6rc395c7jz/nyc-property-sales.zip
#Unzip the data
#!unzip nyc-property-sales.zip > /dev/null; echo " done."
#!ls
#Install Packages

#!pip -q install plotly-express
#!pip -q install shap
#!pip -q install eli5
#!pip -q install lime

Import Packages

#Import basic packages

import warnings
warnings.filterwarnings("ignore")
import time
import pandas as pd               
import numpy as np
import pickle

from sklearn.model_selection import train_test_split   #splitting data
from pylab import rcParams
from sklearn.linear_model import LinearRegression         #linear regression
from sklearn.metrics.regression import mean_squared_error #error metrics
from sklearn.metrics import mean_absolute_error

import seaborn as sns                       #visualisation
import matplotlib.pyplot as plt             #visualisation

%matplotlib inline     
sns.set(color_codes=True)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_21860\4029770141.py in <module>
      4 warnings.filterwarnings("ignore")
      5 import time
----> 6 import pandas as pd
      7 import numpy as np
      8 import pickle

ModuleNotFoundError: No module named 'pandas'
# Code for displaying plotly express plots inline in colab
def configure_plotly_browser_state():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              plotly: 'https://cdn.plot.ly/plotly-latest.min.js?noext',
            },
          });
        </script>
        '''))
  
import plotly_express as px

Read Data

1. Through Pandas

# Read data through Pandas and compute time taken to read

t_start = time.time()
df_prop = pd.read_csv('nyc-rolling-sales.csv')
t_end = time.time()
print('pd.read_csv(): {} s'.format(t_end-t_start)) # time [s]
df_prop.head()
pd.read_csv(): 1.7584502696990967 s
Unnamed: 0 BOROUGH NEIGHBORHOOD BUILDING CLASS CATEGORY TAX CLASS AT PRESENT BLOCK LOT EASE-MENT BUILDING CLASS AT PRESENT ADDRESS ... RESIDENTIAL UNITS COMMERCIAL UNITS TOTAL UNITS LAND SQUARE FEET GROSS SQUARE FEET YEAR BUILT TAX CLASS AT TIME OF SALE BUILDING CLASS AT TIME OF SALE SALE PRICE SALE DATE
0 4 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2A 392 6 C2 153 AVENUE B ... 5 0 5 1633 6440 1900 2 C2 6625000 2017-07-19 00:00:00
1 5 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2 399 26 C7 234 EAST 4TH STREET ... 28 3 31 4616 18690 1900 2 C7 - 2016-12-14 00:00:00
2 6 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2 399 39 C7 197 EAST 3RD STREET ... 16 1 17 2212 7803 1900 2 C7 - 2016-12-09 00:00:00
3 7 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2B 402 21 C4 154 EAST 7TH STREET ... 10 0 10 2272 6794 1913 2 C4 3936272 2016-09-23 00:00:00
4 8 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2A 404 55 C2 301 EAST 10TH STREET ... 6 0 6 2369 4615 1900 2 C2 8000000 2016-11-17 00:00:00

5 rows × 22 columns

2. Through Dask

# Read data through Dask and compute time taken to read

import dask.dataframe as dd

t_start = time.time()
df_dask = dd.read_csv('nyc-rolling-sales.csv')
t_end = time.time()
print('dd.read_csv(): {} s'.format(t_end-t_start)) # time [s]
df_dask.tail()
dd.read_csv(): 0.12152266502380371 s
Unnamed: 0 BOROUGH NEIGHBORHOOD BUILDING CLASS CATEGORY TAX CLASS AT PRESENT BLOCK LOT EASE-MENT BUILDING CLASS AT PRESENT ADDRESS ... RESIDENTIAL UNITS COMMERCIAL UNITS TOTAL UNITS LAND SQUARE FEET GROSS SQUARE FEET YEAR BUILT TAX CLASS AT TIME OF SALE BUILDING CLASS AT TIME OF SALE SALE PRICE SALE DATE
84543 8409 5 WOODROW 02 TWO FAMILY DWELLINGS 1 7349 34 B9 37 QUAIL LANE ... 2 0 2 2400 2575 1998 1 B9 450000 2016-11-28 00:00:00
84544 8410 5 WOODROW 02 TWO FAMILY DWELLINGS 1 7349 78 B9 32 PHEASANT LANE ... 2 0 2 2498 2377 1998 1 B9 550000 2017-04-21 00:00:00
84545 8411 5 WOODROW 02 TWO FAMILY DWELLINGS 1 7351 60 B2 49 PITNEY AVENUE ... 2 0 2 4000 1496 1925 1 B2 460000 2017-07-05 00:00:00
84546 8412 5 WOODROW 22 STORE BUILDINGS 4 7100 28 K6 2730 ARTHUR KILL ROAD ... 0 7 7 208033 64117 2001 4 K6 11693337 2016-12-21 00:00:00
84547 8413 5 WOODROW 35 INDOOR PUBLIC AND CULTURAL FACILITIES 4 7105 679 P9 155 CLAY PIT ROAD ... 0 1 1 10796 2400 2006 4 P9 69300 2016-10-27 00:00:00

5 rows × 22 columns

Observation: Dask is almost 10 times faster than Pandas when reading files.

3. Subsample into n rows

df_prop.shape
(84548, 22)
# Sample n rows
n = df_prop.shape[0]
df_prop = df_prop.sample(n)
df_prop.shape
(84548, 22)

Exploratory Data Analysis

Let’s look into summary statistics

#Let's look into the total number of columns and observations in the dataset
df_prop.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 84548 entries, 26286 to 50843
Data columns (total 22 columns):
Unnamed: 0                        84548 non-null int64
BOROUGH                           84548 non-null int64
NEIGHBORHOOD                      84548 non-null object
BUILDING CLASS CATEGORY           84548 non-null object
TAX CLASS AT PRESENT              84548 non-null object
BLOCK                             84548 non-null int64
LOT                               84548 non-null int64
EASE-MENT                         84548 non-null object
BUILDING CLASS AT PRESENT         84548 non-null object
ADDRESS                           84548 non-null object
APARTMENT NUMBER                  84548 non-null object
ZIP CODE                          84548 non-null int64
RESIDENTIAL UNITS                 84548 non-null int64
COMMERCIAL UNITS                  84548 non-null int64
TOTAL UNITS                       84548 non-null int64
LAND SQUARE FEET                  84548 non-null object
GROSS SQUARE FEET                 84548 non-null object
YEAR BUILT                        84548 non-null int64
TAX CLASS AT TIME OF SALE         84548 non-null int64
BUILDING CLASS AT TIME OF SALE    84548 non-null object
SALE PRICE                        84548 non-null object
SALE DATE                         84548 non-null object
dtypes: int64(10), object(12)
memory usage: 14.8+ MB
df_prop.isnull().sum()
Unnamed: 0                        0
BOROUGH                           0
NEIGHBORHOOD                      0
BUILDING CLASS CATEGORY           0
TAX CLASS AT PRESENT              0
BLOCK                             0
LOT                               0
EASE-MENT                         0
BUILDING CLASS AT PRESENT         0
ADDRESS                           0
APARTMENT NUMBER                  0
ZIP CODE                          0
RESIDENTIAL UNITS                 0
COMMERCIAL UNITS                  0
TOTAL UNITS                       0
LAND SQUARE FEET                  0
GROSS SQUARE FEET                 0
YEAR BUILT                        0
TAX CLASS AT TIME OF SALE         0
BUILDING CLASS AT TIME OF SALE    0
SALE PRICE                        0
SALE DATE                         0
dtype: int64
#Let's look into summary statistics of data
df_prop.describe()
Unnamed: 0 BOROUGH BLOCK LOT ZIP CODE RESIDENTIAL UNITS COMMERCIAL UNITS TOTAL UNITS YEAR BUILT TAX CLASS AT TIME OF SALE
count 84548.000000 84548.000000 84548.000000 84548.000000 84548.000000 84548.000000 84548.000000 84548.000000 84548.000000 84548.000000
mean 10344.359878 2.998758 4237.218976 376.224015 10731.991614 2.025264 0.193559 2.249184 1789.322976 1.657485
std 7151.779436 1.289790 3568.263407 658.136814 1290.879147 16.721037 8.713183 18.972584 537.344993 0.819341
min 4.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
25% 4231.000000 2.000000 1322.750000 22.000000 10305.000000 0.000000 0.000000 1.000000 1920.000000 1.000000
50% 8942.000000 3.000000 3311.000000 50.000000 11209.000000 1.000000 0.000000 1.000000 1940.000000 2.000000
75% 15987.250000 4.000000 6281.000000 1001.000000 11357.000000 2.000000 0.000000 2.000000 1965.000000 2.000000
max 26739.000000 5.000000 16322.000000 9106.000000 11694.000000 1844.000000 2261.000000 2261.000000 2017.000000 4.000000

Observation:

  • There is a column called Unnamed: 0 which is not required as it contains only continuous index numbers

  • The datatypes of saleprice is not correct because the summary statistics of sale price is not displayed

Hence there is a lot of data cleaning to perform.

Data Cleaning

Pandas profiling

'''#Perform Pandas profiling to understand quick overview of columns

import pandas_profiling
report = pandas_profiling.ProfileReport(df_prop)
#covert profile report as html file
report.to_file("property_data.html")'''

Removal of unnecessary columns

Observation:

  • From Pandas profiling we understand EASEMENT column has no significant value and thus has to be removed.

# Let's explore why EASE-MENT has to be rejected
df_prop['EASE-MENT'].unique()
array([' '], dtype=object)
#This column has no significance other than being an iterator
del df_prop['Unnamed: 0']
#This column has no significant value
del df_prop['EASE-MENT']

Observation:

  • From Pandas profiling we understand SALE PRICEcolumns have string value in some rows and thus has to be removed.

  • From Pandas Profiling we understand LAND SQUARE FEET and GROSS SQUARE FEET columns have string values which have to replaced by appropriate values

df_prop['SALE PRICE'] = df_prop['SALE PRICE'].replace(' -  ',np.nan)
df_prop.dropna(inplace=True)
df_prop.shape
(69987, 20)
df_prop['LAND SQUARE FEET'] = df_prop['LAND SQUARE FEET'].replace(' -  ',np.nan)
df_prop['GROSS SQUARE FEET'] = df_prop['GROSS SQUARE FEET'].replace(' -  ',np.nan)
# count the number of NaN values in each column
print(df_prop.isnull().sum())
BOROUGH                               0
NEIGHBORHOOD                          0
BUILDING CLASS CATEGORY               0
TAX CLASS AT PRESENT                  0
BLOCK                                 0
LOT                                   0
BUILDING CLASS AT PRESENT             0
ADDRESS                               0
APARTMENT NUMBER                      0
ZIP CODE                              0
RESIDENTIAL UNITS                     0
COMMERCIAL UNITS                      0
TOTAL UNITS                           0
LAND SQUARE FEET                  21188
GROSS SQUARE FEET                 21739
YEAR BUILT                            0
TAX CLASS AT TIME OF SALE             0
BUILDING CLASS AT TIME OF SALE        0
SALE PRICE                            0
SALE DATE                             0
dtype: int64
df_prop['LAND SQUARE FEET'] = df_prop['LAND SQUARE FEET'].replace('0',np.nan)
df_prop['GROSS SQUARE FEET'] = df_prop['GROSS SQUARE FEET'].replace('0',np.nan)
print(df_prop.isnull().sum())
BOROUGH                               0
NEIGHBORHOOD                          0
BUILDING CLASS CATEGORY               0
TAX CLASS AT PRESENT                  0
BLOCK                                 0
LOT                                   0
BUILDING CLASS AT PRESENT             0
ADDRESS                               0
APARTMENT NUMBER                      0
ZIP CODE                              0
RESIDENTIAL UNITS                     0
COMMERCIAL UNITS                      0
TOTAL UNITS                           0
LAND SQUARE FEET                  31514
GROSS SQUARE FEET                 33156
YEAR BUILT                            0
TAX CLASS AT TIME OF SALE             0
BUILDING CLASS AT TIME OF SALE        0
SALE PRICE                            0
SALE DATE                             0
dtype: int64
df_prop.describe()
BOROUGH BLOCK LOT ZIP CODE RESIDENTIAL UNITS COMMERCIAL UNITS TOTAL UNITS YEAR BUILT TAX CLASS AT TIME OF SALE
count 69987.000000 69987.000000 69987.000000 69987.000000 69987.000000 69987.000000 69987.000000 69987.000000 69987.000000
mean 2.921928 4196.072528 373.828397 10741.455185 1.899553 0.172489 2.092203 1799.348236 1.641976
std 1.235688 3429.196524 656.096528 1263.234938 14.549545 9.123717 17.276100 520.884552 0.771162
min 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
25% 2.000000 1348.000000 22.000000 10306.000000 0.000000 0.000000 0.000000 1920.000000 1.000000
50% 3.000000 3378.000000 50.000000 11209.000000 1.000000 0.000000 1.000000 1937.000000 2.000000
75% 4.000000 6186.000000 709.000000 11249.000000 2.000000 0.000000 2.000000 1965.000000 2.000000
max 5.000000 16319.000000 9106.000000 11694.000000 1844.000000 2261.000000 2261.000000 2017.000000 4.000000
## Define a function impute_median and fill land square feet and gross square feet with median values
def impute_median(series):
    return series.fillna(series.median())

df_prop['LAND SQUARE FEET'] = df_prop['LAND SQUARE FEET'].transform(impute_median)
df_prop['GROSS SQUARE FEET'] = df_prop['GROSS SQUARE FEET'].transform(impute_median)
df_prop.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 69987 entries, 26286 to 73846
Data columns (total 20 columns):
BOROUGH                           69987 non-null int64
NEIGHBORHOOD                      69987 non-null object
BUILDING CLASS CATEGORY           69987 non-null object
TAX CLASS AT PRESENT              69987 non-null object
BLOCK                             69987 non-null int64
LOT                               69987 non-null int64
BUILDING CLASS AT PRESENT         69987 non-null object
ADDRESS                           69987 non-null object
APARTMENT NUMBER                  69987 non-null object
ZIP CODE                          69987 non-null int64
RESIDENTIAL UNITS                 69987 non-null int64
COMMERCIAL UNITS                  69987 non-null int64
TOTAL UNITS                       69987 non-null int64
LAND SQUARE FEET                  69987 non-null object
GROSS SQUARE FEET                 69987 non-null object
YEAR BUILT                        69987 non-null int64
TAX CLASS AT TIME OF SALE         69987 non-null int64
BUILDING CLASS AT TIME OF SALE    69987 non-null object
SALE PRICE                        69987 non-null object
SALE DATE                         69987 non-null object
dtypes: int64(9), object(11)
memory usage: 11.2+ MB
df_prop.describe()
BOROUGH BLOCK LOT ZIP CODE RESIDENTIAL UNITS COMMERCIAL UNITS TOTAL UNITS YEAR BUILT TAX CLASS AT TIME OF SALE
count 69987.000000 69987.000000 69987.000000 69987.000000 69987.000000 69987.000000 69987.000000 69987.000000 69987.000000
mean 2.921928 4196.072528 373.828397 10741.455185 1.899553 0.172489 2.092203 1799.348236 1.641976
std 1.235688 3429.196524 656.096528 1263.234938 14.549545 9.123717 17.276100 520.884552 0.771162
min 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
25% 2.000000 1348.000000 22.000000 10306.000000 0.000000 0.000000 0.000000 1920.000000 1.000000
50% 3.000000 3378.000000 50.000000 11209.000000 1.000000 0.000000 1.000000 1937.000000 2.000000
75% 4.000000 6186.000000 709.000000 11249.000000 2.000000 0.000000 2.000000 1965.000000 2.000000
max 5.000000 16319.000000 9106.000000 11694.000000 1844.000000 2261.000000 2261.000000 2017.000000 4.000000
#Convert few column datatypes into appropriate ones for conserving memory

df_prop['TAX CLASS AT TIME OF SALE'] = df_prop['TAX CLASS AT TIME OF SALE'].astype('category')
df_prop['TAX CLASS AT PRESENT'] = df_prop['TAX CLASS AT PRESENT'].astype('category')
df_prop['LAND SQUARE FEET'] = pd.to_numeric(df_prop['LAND SQUARE FEET'], errors='coerce')
df_prop['GROSS SQUARE FEET']= pd.to_numeric(df_prop['GROSS SQUARE FEET'], errors='coerce')
df_prop['SALE PRICE'] = pd.to_numeric(df_prop['SALE PRICE'], errors='coerce')
df_prop['BOROUGH'] = df_prop['BOROUGH'].astype('category')
#The datatypes have now been changed
df_prop.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 69987 entries, 26286 to 73846
Data columns (total 20 columns):
BOROUGH                           69987 non-null category
NEIGHBORHOOD                      69987 non-null object
BUILDING CLASS CATEGORY           69987 non-null object
TAX CLASS AT PRESENT              69987 non-null category
BLOCK                             69987 non-null int64
LOT                               69987 non-null int64
BUILDING CLASS AT PRESENT         69987 non-null object
ADDRESS                           69987 non-null object
APARTMENT NUMBER                  69987 non-null object
ZIP CODE                          69987 non-null int64
RESIDENTIAL UNITS                 69987 non-null int64
COMMERCIAL UNITS                  69987 non-null int64
TOTAL UNITS                       69987 non-null int64
LAND SQUARE FEET                  69987 non-null float64
GROSS SQUARE FEET                 69987 non-null float64
YEAR BUILT                        69987 non-null int64
TAX CLASS AT TIME OF SALE         69987 non-null category
BUILDING CLASS AT TIME OF SALE    69987 non-null object
SALE PRICE                        69987 non-null int64
SALE DATE                         69987 non-null object
dtypes: category(3), float64(2), int64(8), object(7)
memory usage: 9.8+ MB
df_prop.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 69987 entries, 26286 to 73846
Data columns (total 20 columns):
BOROUGH                           69987 non-null category
NEIGHBORHOOD                      69987 non-null object
BUILDING CLASS CATEGORY           69987 non-null object
TAX CLASS AT PRESENT              69987 non-null category
BLOCK                             69987 non-null int64
LOT                               69987 non-null int64
BUILDING CLASS AT PRESENT         69987 non-null object
ADDRESS                           69987 non-null object
APARTMENT NUMBER                  69987 non-null object
ZIP CODE                          69987 non-null int64
RESIDENTIAL UNITS                 69987 non-null int64
COMMERCIAL UNITS                  69987 non-null int64
TOTAL UNITS                       69987 non-null int64
LAND SQUARE FEET                  69987 non-null float64
GROSS SQUARE FEET                 69987 non-null float64
YEAR BUILT                        69987 non-null int64
TAX CLASS AT TIME OF SALE         69987 non-null category
BUILDING CLASS AT TIME OF SALE    69987 non-null object
SALE PRICE                        69987 non-null int64
SALE DATE                         69987 non-null object
dtypes: category(3), float64(2), int64(8), object(7)
memory usage: 9.8+ MB
# Let's remove sale price with a nonsensically small dollar amount: $0 most commonly. 
# Since these sales are actually transfers of deeds between parties: for example, parents transferring ownership to their home to a child after moving out for retirement.

df_prop = df_prop[df_prop['SALE PRICE']!=0]
#Let's also remove observations that have gross square feet less than 400 sq. ft
#Let's also remove observations that have gross square feet less than 400 sq. ft
#Let's also remove observations that have sale price than 1000 dollars

df_prop = df_prop[df_prop['GROSS SQUARE FEET']>400]
df_prop = df_prop[df_prop['LAND SQUARE FEET']>400]
df_prop = df_prop[df_prop['SALE PRICE']>1000]
df_prop.describe()
BLOCK LOT ZIP CODE RESIDENTIAL UNITS COMMERCIAL UNITS TOTAL UNITS LAND SQUARE FEET GROSS SQUARE FEET YEAR BUILT SALE PRICE
count 58544.000000 58544.000000 58544.000000 58544.000000 58544.000000 58544.000000 5.854400e+04 5.854400e+04 58544.000000 5.854400e+04
mean 4137.998565 391.834227 10707.269695 1.706716 0.163877 1.882259 3.622433e+03 3.259362e+03 1812.223917 1.524977e+06
std 3566.123396 670.343056 1251.444132 14.179364 9.922390 17.412926 3.541128e+04 2.380047e+04 502.376457 1.245480e+07
min 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 4.050000e+02 4.060000e+02 0.000000 1.110000e+03
25% 1271.750000 21.000000 10280.000000 0.000000 0.000000 0.000000 2.500000e+03 1.914750e+03 1920.000000 3.850000e+05
50% 3145.500000 50.000000 11207.000000 1.000000 0.000000 1.000000 2.500000e+03 2.000000e+03 1940.000000 6.400000e+05
75% 6099.000000 1002.000000 11356.000000 1.000000 0.000000 2.000000 2.500000e+03 2.000000e+03 1966.000000 1.098618e+06
max 16319.000000 9106.000000 11694.000000 1844.000000 2261.000000 2261.000000 4.252327e+06 3.750565e+06 2017.000000 2.210000e+09
df_prop[df_prop['SALE PRICE']==2210000000]
BOROUGH NEIGHBORHOOD BUILDING CLASS CATEGORY TAX CLASS AT PRESENT BLOCK LOT BUILDING CLASS AT PRESENT ADDRESS APARTMENT NUMBER ZIP CODE RESIDENTIAL UNITS COMMERCIAL UNITS TOTAL UNITS LAND SQUARE FEET GROSS SQUARE FEET YEAR BUILT TAX CLASS AT TIME OF SALE BUILDING CLASS AT TIME OF SALE SALE PRICE SALE DATE
7447 1 MIDTOWN CBD 21 OFFICE BUILDINGS 4 1301 1 O4 245 PARK AVENUE 10167 0 35 35 81336.0 1586886.0 1966 4 O4 2210000000 2017-05-05 00:00:00

Observation: The most expensive property in NYC is a whopping 2 billion dollars which can be considered as an outlier.

Let’s remove outiers!

q = df_prop["SALE PRICE"].quantile(0.99)
q
14000000.0
df_prop = df_prop[df_prop["SALE PRICE"] < q]
df_prop_lin = df_prop.copy()
# Convert sale date into time,month,year and day
df_prop['SALE DATE']=pd.to_datetime(df_prop['SALE DATE'])
df_prop['year']=df_prop['SALE DATE'].dt.year
df_prop['month']=df_prop['SALE DATE'].dt.month
df_prop['day']=df_prop['SALE DATE'].dt.day
df_prop['time']=df_prop['SALE DATE'].dt.hour
df_prop['day_week']=df_prop['SALE DATE'].dt.weekday_name
df_prop.head()
BOROUGH NEIGHBORHOOD BUILDING CLASS CATEGORY TAX CLASS AT PRESENT BLOCK LOT BUILDING CLASS AT PRESENT ADDRESS APARTMENT NUMBER ZIP CODE ... YEAR BUILT TAX CLASS AT TIME OF SALE BUILDING CLASS AT TIME OF SALE SALE PRICE SALE DATE year month day time day_week
26286 3 BAY RIDGE 10 COOPS - ELEVATOR APARTMENTS 2 6089 44 D4 9040 FORT HAMILTON PARKWA, 76 11209 ... 1955 2 D4 219000 2017-03-09 2017 3 9 0 Thursday
55023 4 ELMHURST 07 RENTALS - WALKUP APARTMENTS 2 1487 61 C1 40-03 76TH STREET 11373 ... 1927 2 C1 3250000 2016-11-02 2016 11 2 0 Wednesday
56763 4 FLUSHING-NORTH 01 ONE FAMILY DWELLINGS 1 7357 34 A5 48-05 WEEKS LANE 11365 ... 1950 1 A5 715000 2016-09-09 2016 9 9 0 Friday
47664 3 WILLIAMSBURG-CENTRAL 13 CONDOS - ELEVATOR APARTMENTS 2 2245 1319 R4 80 LORIMER STREET 3B 11206 ... 2013 2 R4 578920 2016-09-29 2016 9 29 0 Thursday
47236 3 SUNSET PARK 07 RENTALS - WALKUP APARTMENTS 2A 685 4 C2 893 4 AVENUE 11232 ... 1925 2 C2 1250000 2017-08-09 2017 8 9 0 Wednesday

5 rows × 25 columns

df_prop.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 57948 entries, 26286 to 73846
Data columns (total 25 columns):
BOROUGH                           57948 non-null category
NEIGHBORHOOD                      57948 non-null object
BUILDING CLASS CATEGORY           57948 non-null object
TAX CLASS AT PRESENT              57948 non-null category
BLOCK                             57948 non-null int64
LOT                               57948 non-null int64
BUILDING CLASS AT PRESENT         57948 non-null object
ADDRESS                           57948 non-null object
APARTMENT NUMBER                  57948 non-null object
ZIP CODE                          57948 non-null int64
RESIDENTIAL UNITS                 57948 non-null int64
COMMERCIAL UNITS                  57948 non-null int64
TOTAL UNITS                       57948 non-null int64
LAND SQUARE FEET                  57948 non-null float64
GROSS SQUARE FEET                 57948 non-null float64
YEAR BUILT                        57948 non-null int64
TAX CLASS AT TIME OF SALE         57948 non-null category
BUILDING CLASS AT TIME OF SALE    57948 non-null object
SALE PRICE                        57948 non-null int64
SALE DATE                         57948 non-null datetime64[ns]
year                              57948 non-null int64
month                             57948 non-null int64
day                               57948 non-null int64
time                              57948 non-null int64
day_week                          57948 non-null object
dtypes: category(3), datetime64[ns](1), float64(2), int64(12), object(7)
memory usage: 10.3+ MB

Data Visualization

#Assign numbered bouroughs to bourough names
dic = {1: 'Manhattan', 2: 'Bronx', 3: 'Brooklyn', 4: 'Queens', 5:'Staten Island'}
df_prop["borough_name"] = df_prop["BOROUGH"].apply(lambda x: dic[x])

Count of properties in NYC in each bororugh

%matplotlib inline
df_prop.borough_name.value_counts().nlargest().plot(kind='bar', figsize=(10,5))
plt.title("Number of properties by city")
plt.ylabel('Number of properties')
plt.xlabel('City');
plt.show()
_images/NYC_Property_Sales_60_0.png

Distribution of Sale Price

df_prop['SALE PRICE'].describe()
count    5.794800e+04
mean     1.056701e+06
std      1.478971e+06
min      1.110000e+03
25%      3.820000e+05
50%      6.311065e+05
75%      1.054651e+06
max      1.399500e+07
Name: SALE PRICE, dtype: float64

Observation: The maximum sale price is 14 million

df_prop['SALE PRICE'].plot.hist(bins=20, figsize=(12, 6), edgecolor = 'white')
plt.xlabel('price', fontsize=12)
plt.title('Price Distribution', fontsize=12)
plt.show()
_images/NYC_Property_Sales_64_0.png

Observation: The distribution is highly skewed towards the right which implies there are lesser properties that have a very high prices.

sns.boxplot(df_prop["SALE PRICE"])
<matplotlib.axes._subplots.AxesSubplot at 0x13bca947c88>
_images/NYC_Property_Sales_66_1.png
df_prop["log_price"] = np.log(df_prop["SALE PRICE"] + 1)
sns.boxplot(df_prop.log_price)
<matplotlib.axes._subplots.AxesSubplot at 0x13bc8eaa588>
_images/NYC_Property_Sales_67_1.png

Correlation between selected variables

The heat map produces a correlation plot between variables of the dataframe.

plt.figure(figsize=(15,10))
c = df_prop[df_prop.columns.values[0:19]].corr()
sns.heatmap(c,cmap="BrBG",annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x13bcc65fb70>
_images/NYC_Property_Sales_69_1.png

Observation: The heat map illustrates that sale price is independent of all column values that could be considered for linear regression.

Explore how gross square feet affects sale price

configure_plotly_browser_state()
px.scatter(df_prop, x="GROSS SQUARE FEET", y="SALE PRICE", size ="TOTAL UNITS" ,color="borough_name",
           hover_data=["BUILDING CLASS CATEGORY","LOT"], log_x=True, size_max=60)

Observation:

  • Properties with more total units do not fetch larger sales price

  • Properties in Staten Island have comparitively lesser sales price in comparison with other boroughs in New york city

Explore how tax class at the time of sale affect sales price

configure_plotly_browser_state()
px.box(df_prop, x="borough_name", y="SALE PRICE", color="TAX CLASS AT TIME OF SALE",hover_data=['NEIGHBORHOOD', 'BUILDING CLASS CATEGORY'],notched=True)

Observation:

  • Manhatten has the highest priced properties that have a tax class 1 representing residential property of up to three units (such as one-,two-, and three-family homes and small stores or offices with one or two attached apartments) as compared to other boroughs.

  • Properties in Staten Island have comparitively lesser sales price in comparison with other boroughs in New york city

configure_plotly_browser_state()
px.box(df_prop, x="day_week", y="SALE PRICE", color="TAX CLASS AT TIME OF SALE", notched=True)

Observation:

  • On Saturdays there are no sales for the tax class 4 which represents properties such as such as offices, factories, warehouses, garage buildings, etc.

Model Building

Prepare the Data for model building

Delete columns not necessary for prediction

df_prop_lin.columns
Index(['BOROUGH', 'NEIGHBORHOOD', 'BUILDING CLASS CATEGORY',
       'TAX CLASS AT PRESENT', 'BLOCK', 'LOT', 'BUILDING CLASS AT PRESENT',
       'ADDRESS', 'APARTMENT NUMBER', 'ZIP CODE', 'RESIDENTIAL UNITS',
       'COMMERCIAL UNITS', 'TOTAL UNITS', 'LAND SQUARE FEET',
       'GROSS SQUARE FEET', 'YEAR BUILT', 'TAX CLASS AT TIME OF SALE',
       'BUILDING CLASS AT TIME OF SALE', 'SALE PRICE', 'SALE DATE'],
      dtype='object')
#Dropping few columns
del df_prop_lin['BUILDING CLASS AT PRESENT']
del df_prop_lin['BUILDING CLASS AT TIME OF SALE']
del df_prop_lin['NEIGHBORHOOD']
del df_prop_lin['ADDRESS']
del df_prop_lin['SALE DATE']
del df_prop_lin['APARTMENT NUMBER']
del df_prop_lin['RESIDENTIAL UNITS']
del df_prop_lin['COMMERCIAL UNITS']
df_prop_lin.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 57948 entries, 26286 to 73846
Data columns (total 12 columns):
BOROUGH                      57948 non-null category
BUILDING CLASS CATEGORY      57948 non-null object
TAX CLASS AT PRESENT         57948 non-null category
BLOCK                        57948 non-null int64
LOT                          57948 non-null int64
ZIP CODE                     57948 non-null int64
TOTAL UNITS                  57948 non-null int64
LAND SQUARE FEET             57948 non-null float64
GROSS SQUARE FEET            57948 non-null float64
YEAR BUILT                   57948 non-null int64
TAX CLASS AT TIME OF SALE    57948 non-null category
SALE PRICE                   57948 non-null int64
dtypes: category(3), float64(2), int64(6), object(1)
memory usage: 4.6+ MB
df_prop_lin.head()
BOROUGH BUILDING CLASS CATEGORY TAX CLASS AT PRESENT BLOCK LOT ZIP CODE TOTAL UNITS LAND SQUARE FEET GROSS SQUARE FEET YEAR BUILT TAX CLASS AT TIME OF SALE SALE PRICE
26286 3 10 COOPS - ELEVATOR APARTMENTS 2 6089 44 11209 0 2500.0 2000.0 1955 2 219000
55023 4 07 RENTALS - WALKUP APARTMENTS 2 1487 61 11373 16 4000.0 9740.0 1927 2 3250000
56763 4 01 ONE FAMILY DWELLINGS 1 7357 34 11365 1 2760.0 1188.0 1950 1 715000
47664 3 13 CONDOS - ELEVATOR APARTMENTS 2 2245 1319 11206 1 2500.0 2000.0 2013 2 578920
47236 3 07 RENTALS - WALKUP APARTMENTS 2A 685 4 11232 6 2587.0 4000.0 1925 2 1250000
season = [winter, rainy, summer]
 season_rainy, season_summer
  File "<ipython-input-53-99e89a64e56d>", line 2
    season_rainy, season_summer
    ^
IndentationError: unexpected indent

Perform one-hot encoding for categorical variables

#Select the variables to be one-hot encoded
one_hot_features = ['BOROUGH', 'BUILDING CLASS CATEGORY','TAX CLASS AT PRESENT','TAX CLASS AT TIME OF SALE']
# Convert categorical variables into dummy/indicator variables (i.e. one-hot encoding).
one_hot_encoded = pd.get_dummies(df_prop_lin[one_hot_features],drop_first=True)
one_hot_encoded
#one_hot_encoded.info(verbose=True, memory_usage=True, null_counts=True)
BOROUGH_2 BOROUGH_3 BOROUGH_4 BOROUGH_5 BUILDING CLASS CATEGORY_02 TWO FAMILY DWELLINGS BUILDING CLASS CATEGORY_03 THREE FAMILY DWELLINGS BUILDING CLASS CATEGORY_04 TAX CLASS 1 CONDOS BUILDING CLASS CATEGORY_05 TAX CLASS 1 VACANT LAND BUILDING CLASS CATEGORY_06 TAX CLASS 1 - OTHER BUILDING CLASS CATEGORY_07 RENTALS - WALKUP APARTMENTS ... TAX CLASS AT PRESENT_1C TAX CLASS AT PRESENT_2 TAX CLASS AT PRESENT_2A TAX CLASS AT PRESENT_2B TAX CLASS AT PRESENT_2C TAX CLASS AT PRESENT_3 TAX CLASS AT PRESENT_4 TAX CLASS AT TIME OF SALE_2 TAX CLASS AT TIME OF SALE_3 TAX CLASS AT TIME OF SALE_4
26286 0 1 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
55023 0 0 1 0 0 0 0 0 0 1 ... 0 1 0 0 0 0 0 1 0 0
56763 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
47664 0 1 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
47236 0 1 0 0 0 0 0 0 0 1 ... 0 0 1 0 0 0 0 1 0 0
83254 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
82112 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
10044 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
44548 0 1 0 0 0 0 1 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
48289 0 1 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
46216 0 1 0 0 0 0 1 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
234 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 1 0 0
24472 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
73449 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
76944 0 0 0 1 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
72359 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
40709 0 1 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
42960 0 1 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
80212 0 0 0 1 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
44195 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
30963 0 1 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
6279 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
61077 0 0 1 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
44119 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 1
22973 1 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
55237 0 0 1 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
7829 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
74118 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
11537 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
55407 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
75092 0 0 1 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
83271 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
80674 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
14337 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
40328 0 1 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6497 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
33037 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 1 0 0
65621 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
65798 0 0 1 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
29235 0 1 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
13274 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
44693 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 1 0 0
38986 0 1 0 0 0 0 0 0 0 1 ... 0 0 1 0 0 0 0 1 0 0
17621 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
12390 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
24038 1 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
73328 0 0 1 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
724 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
36921 0 1 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
19536 1 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
31975 0 1 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
80899 0 0 0 1 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
45516 0 1 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
15179 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
72025 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
26264 0 1 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
56193 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
65888 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
11837 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
73846 0 0 1 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

57948 rows × 60 columns

df_prop_lin.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 57948 entries, 26286 to 73846
Data columns (total 12 columns):
BOROUGH                      57948 non-null category
BUILDING CLASS CATEGORY      57948 non-null object
TAX CLASS AT PRESENT         57948 non-null category
BLOCK                        57948 non-null int64
LOT                          57948 non-null int64
ZIP CODE                     57948 non-null int64
TOTAL UNITS                  57948 non-null int64
LAND SQUARE FEET             57948 non-null float64
GROSS SQUARE FEET            57948 non-null float64
YEAR BUILT                   57948 non-null int64
TAX CLASS AT TIME OF SALE    57948 non-null category
SALE PRICE                   57948 non-null int64
dtypes: category(3), float64(2), int64(6), object(1)
memory usage: 4.6+ MB
one_hot_encoded.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 57948 entries, 26286 to 73846
Data columns (total 60 columns):
BOROUGH_2                                                               57948 non-null uint8
BOROUGH_3                                                               57948 non-null uint8
BOROUGH_4                                                               57948 non-null uint8
BOROUGH_5                                                               57948 non-null uint8
BUILDING CLASS CATEGORY_02 TWO FAMILY DWELLINGS                         57948 non-null uint8
BUILDING CLASS CATEGORY_03 THREE FAMILY DWELLINGS                       57948 non-null uint8
BUILDING CLASS CATEGORY_04 TAX CLASS 1 CONDOS                           57948 non-null uint8
BUILDING CLASS CATEGORY_05 TAX CLASS 1 VACANT LAND                      57948 non-null uint8
BUILDING CLASS CATEGORY_06 TAX CLASS 1 - OTHER                          57948 non-null uint8
BUILDING CLASS CATEGORY_07 RENTALS - WALKUP APARTMENTS                  57948 non-null uint8
BUILDING CLASS CATEGORY_08 RENTALS - ELEVATOR APARTMENTS                57948 non-null uint8
BUILDING CLASS CATEGORY_09 COOPS - WALKUP APARTMENTS                    57948 non-null uint8
BUILDING CLASS CATEGORY_10 COOPS - ELEVATOR APARTMENTS                  57948 non-null uint8
BUILDING CLASS CATEGORY_11 SPECIAL CONDO BILLING LOTS                   57948 non-null uint8
BUILDING CLASS CATEGORY_11A CONDO-RENTALS                               57948 non-null uint8
BUILDING CLASS CATEGORY_12 CONDOS - WALKUP APARTMENTS                   57948 non-null uint8
BUILDING CLASS CATEGORY_13 CONDOS - ELEVATOR APARTMENTS                 57948 non-null uint8
BUILDING CLASS CATEGORY_14 RENTALS - 4-10 UNIT                          57948 non-null uint8
BUILDING CLASS CATEGORY_15 CONDOS - 2-10 UNIT RESIDENTIAL               57948 non-null uint8
BUILDING CLASS CATEGORY_16 CONDOS - 2-10 UNIT WITH COMMERCIAL UNIT      57948 non-null uint8
BUILDING CLASS CATEGORY_17 CONDO COOPS                                  57948 non-null uint8
BUILDING CLASS CATEGORY_21 OFFICE BUILDINGS                             57948 non-null uint8
BUILDING CLASS CATEGORY_22 STORE BUILDINGS                              57948 non-null uint8
BUILDING CLASS CATEGORY_23 LOFT BUILDINGS                               57948 non-null uint8
BUILDING CLASS CATEGORY_26 OTHER HOTELS                                 57948 non-null uint8
BUILDING CLASS CATEGORY_27 FACTORIES                                    57948 non-null uint8
BUILDING CLASS CATEGORY_28 COMMERCIAL CONDOS                            57948 non-null uint8
BUILDING CLASS CATEGORY_29 COMMERCIAL GARAGES                           57948 non-null uint8
BUILDING CLASS CATEGORY_30 WAREHOUSES                                   57948 non-null uint8
BUILDING CLASS CATEGORY_31 COMMERCIAL VACANT LAND                       57948 non-null uint8
BUILDING CLASS CATEGORY_32 HOSPITAL AND HEALTH FACILITIES               57948 non-null uint8
BUILDING CLASS CATEGORY_33 EDUCATIONAL FACILITIES                       57948 non-null uint8
BUILDING CLASS CATEGORY_34 THEATRES                                     57948 non-null uint8
BUILDING CLASS CATEGORY_35 INDOOR PUBLIC AND CULTURAL FACILITIES        57948 non-null uint8
BUILDING CLASS CATEGORY_36 OUTDOOR RECREATIONAL FACILITIES              57948 non-null uint8
BUILDING CLASS CATEGORY_37 RELIGIOUS FACILITIES                         57948 non-null uint8
BUILDING CLASS CATEGORY_38 ASYLUMS AND HOMES                            57948 non-null uint8
BUILDING CLASS CATEGORY_39 TRANSPORTATION FACILITIES                    57948 non-null uint8
BUILDING CLASS CATEGORY_41 TAX CLASS 4 - OTHER                          57948 non-null uint8
BUILDING CLASS CATEGORY_42 CONDO CULTURAL/MEDICAL/EDUCATIONAL/ETC       57948 non-null uint8
BUILDING CLASS CATEGORY_43 CONDO OFFICE BUILDINGS                       57948 non-null uint8
BUILDING CLASS CATEGORY_44 CONDO PARKING                                57948 non-null uint8
BUILDING CLASS CATEGORY_45 CONDO HOTELS                                 57948 non-null uint8
BUILDING CLASS CATEGORY_46 CONDO STORE BUILDINGS                        57948 non-null uint8
BUILDING CLASS CATEGORY_47 CONDO NON-BUSINESS STORAGE                   57948 non-null uint8
BUILDING CLASS CATEGORY_48 CONDO TERRACES/GARDENS/CABANAS               57948 non-null uint8
BUILDING CLASS CATEGORY_49 CONDO WAREHOUSES/FACTORY/INDUS               57948 non-null uint8
TAX CLASS AT PRESENT_1                                                  57948 non-null uint8
TAX CLASS AT PRESENT_1A                                                 57948 non-null uint8
TAX CLASS AT PRESENT_1B                                                 57948 non-null uint8
TAX CLASS AT PRESENT_1C                                                 57948 non-null uint8
TAX CLASS AT PRESENT_2                                                  57948 non-null uint8
TAX CLASS AT PRESENT_2A                                                 57948 non-null uint8
TAX CLASS AT PRESENT_2B                                                 57948 non-null uint8
TAX CLASS AT PRESENT_2C                                                 57948 non-null uint8
TAX CLASS AT PRESENT_3                                                  57948 non-null uint8
TAX CLASS AT PRESENT_4                                                  57948 non-null uint8
TAX CLASS AT TIME OF SALE_2                                             57948 non-null uint8
TAX CLASS AT TIME OF SALE_3                                             57948 non-null uint8
TAX CLASS AT TIME OF SALE_4                                             57948 non-null uint8
dtypes: uint8(60)
memory usage: 3.8 MB
# Replacing categorical columns with dummies
fdf = df_prop_lin.drop(one_hot_features,axis=1)
fdf = pd.concat([fdf, one_hot_encoded] ,axis=1)
fdf.info()
#print (fdf.shape)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 57948 entries, 26286 to 73846
Data columns (total 68 columns):
BLOCK                                                                   57948 non-null int64
LOT                                                                     57948 non-null int64
ZIP CODE                                                                57948 non-null int64
TOTAL UNITS                                                             57948 non-null int64
LAND SQUARE FEET                                                        57948 non-null float64
GROSS SQUARE FEET                                                       57948 non-null float64
YEAR BUILT                                                              57948 non-null int64
SALE PRICE                                                              57948 non-null int64
BOROUGH_2                                                               57948 non-null uint8
BOROUGH_3                                                               57948 non-null uint8
BOROUGH_4                                                               57948 non-null uint8
BOROUGH_5                                                               57948 non-null uint8
BUILDING CLASS CATEGORY_02 TWO FAMILY DWELLINGS                         57948 non-null uint8
BUILDING CLASS CATEGORY_03 THREE FAMILY DWELLINGS                       57948 non-null uint8
BUILDING CLASS CATEGORY_04 TAX CLASS 1 CONDOS                           57948 non-null uint8
BUILDING CLASS CATEGORY_05 TAX CLASS 1 VACANT LAND                      57948 non-null uint8
BUILDING CLASS CATEGORY_06 TAX CLASS 1 - OTHER                          57948 non-null uint8
BUILDING CLASS CATEGORY_07 RENTALS - WALKUP APARTMENTS                  57948 non-null uint8
BUILDING CLASS CATEGORY_08 RENTALS - ELEVATOR APARTMENTS                57948 non-null uint8
BUILDING CLASS CATEGORY_09 COOPS - WALKUP APARTMENTS                    57948 non-null uint8
BUILDING CLASS CATEGORY_10 COOPS - ELEVATOR APARTMENTS                  57948 non-null uint8
BUILDING CLASS CATEGORY_11 SPECIAL CONDO BILLING LOTS                   57948 non-null uint8
BUILDING CLASS CATEGORY_11A CONDO-RENTALS                               57948 non-null uint8
BUILDING CLASS CATEGORY_12 CONDOS - WALKUP APARTMENTS                   57948 non-null uint8
BUILDING CLASS CATEGORY_13 CONDOS - ELEVATOR APARTMENTS                 57948 non-null uint8
BUILDING CLASS CATEGORY_14 RENTALS - 4-10 UNIT                          57948 non-null uint8
BUILDING CLASS CATEGORY_15 CONDOS - 2-10 UNIT RESIDENTIAL               57948 non-null uint8
BUILDING CLASS CATEGORY_16 CONDOS - 2-10 UNIT WITH COMMERCIAL UNIT      57948 non-null uint8
BUILDING CLASS CATEGORY_17 CONDO COOPS                                  57948 non-null uint8
BUILDING CLASS CATEGORY_21 OFFICE BUILDINGS                             57948 non-null uint8
BUILDING CLASS CATEGORY_22 STORE BUILDINGS                              57948 non-null uint8
BUILDING CLASS CATEGORY_23 LOFT BUILDINGS                               57948 non-null uint8
BUILDING CLASS CATEGORY_26 OTHER HOTELS                                 57948 non-null uint8
BUILDING CLASS CATEGORY_27 FACTORIES                                    57948 non-null uint8
BUILDING CLASS CATEGORY_28 COMMERCIAL CONDOS                            57948 non-null uint8
BUILDING CLASS CATEGORY_29 COMMERCIAL GARAGES                           57948 non-null uint8
BUILDING CLASS CATEGORY_30 WAREHOUSES                                   57948 non-null uint8
BUILDING CLASS CATEGORY_31 COMMERCIAL VACANT LAND                       57948 non-null uint8
BUILDING CLASS CATEGORY_32 HOSPITAL AND HEALTH FACILITIES               57948 non-null uint8
BUILDING CLASS CATEGORY_33 EDUCATIONAL FACILITIES                       57948 non-null uint8
BUILDING CLASS CATEGORY_34 THEATRES                                     57948 non-null uint8
BUILDING CLASS CATEGORY_35 INDOOR PUBLIC AND CULTURAL FACILITIES        57948 non-null uint8
BUILDING CLASS CATEGORY_36 OUTDOOR RECREATIONAL FACILITIES              57948 non-null uint8
BUILDING CLASS CATEGORY_37 RELIGIOUS FACILITIES                         57948 non-null uint8
BUILDING CLASS CATEGORY_38 ASYLUMS AND HOMES                            57948 non-null uint8
BUILDING CLASS CATEGORY_39 TRANSPORTATION FACILITIES                    57948 non-null uint8
BUILDING CLASS CATEGORY_41 TAX CLASS 4 - OTHER                          57948 non-null uint8
BUILDING CLASS CATEGORY_42 CONDO CULTURAL/MEDICAL/EDUCATIONAL/ETC       57948 non-null uint8
BUILDING CLASS CATEGORY_43 CONDO OFFICE BUILDINGS                       57948 non-null uint8
BUILDING CLASS CATEGORY_44 CONDO PARKING                                57948 non-null uint8
BUILDING CLASS CATEGORY_45 CONDO HOTELS                                 57948 non-null uint8
BUILDING CLASS CATEGORY_46 CONDO STORE BUILDINGS                        57948 non-null uint8
BUILDING CLASS CATEGORY_47 CONDO NON-BUSINESS STORAGE                   57948 non-null uint8
BUILDING CLASS CATEGORY_48 CONDO TERRACES/GARDENS/CABANAS               57948 non-null uint8
BUILDING CLASS CATEGORY_49 CONDO WAREHOUSES/FACTORY/INDUS               57948 non-null uint8
TAX CLASS AT PRESENT_1                                                  57948 non-null uint8
TAX CLASS AT PRESENT_1A                                                 57948 non-null uint8
TAX CLASS AT PRESENT_1B                                                 57948 non-null uint8
TAX CLASS AT PRESENT_1C                                                 57948 non-null uint8
TAX CLASS AT PRESENT_2                                                  57948 non-null uint8
TAX CLASS AT PRESENT_2A                                                 57948 non-null uint8
TAX CLASS AT PRESENT_2B                                                 57948 non-null uint8
TAX CLASS AT PRESENT_2C                                                 57948 non-null uint8
TAX CLASS AT PRESENT_3                                                  57948 non-null uint8
TAX CLASS AT PRESENT_4                                                  57948 non-null uint8
TAX CLASS AT TIME OF SALE_2                                             57948 non-null uint8
TAX CLASS AT TIME OF SALE_3                                             57948 non-null uint8
TAX CLASS AT TIME OF SALE_4                                             57948 non-null uint8
dtypes: float64(2), int64(6), uint8(60)
memory usage: 7.3 MB
del fdf['LOT']
del fdf['BLOCK']
del fdf['ZIP CODE']
del fdf['YEAR BUILT']
fdf.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 57948 entries, 26286 to 73846
Data columns (total 64 columns):
TOTAL UNITS                                                             57948 non-null int64
LAND SQUARE FEET                                                        57948 non-null float64
GROSS SQUARE FEET                                                       57948 non-null float64
SALE PRICE                                                              57948 non-null int64
BOROUGH_2                                                               57948 non-null uint8
BOROUGH_3                                                               57948 non-null uint8
BOROUGH_4                                                               57948 non-null uint8
BOROUGH_5                                                               57948 non-null uint8
BUILDING CLASS CATEGORY_02 TWO FAMILY DWELLINGS                         57948 non-null uint8
BUILDING CLASS CATEGORY_03 THREE FAMILY DWELLINGS                       57948 non-null uint8
BUILDING CLASS CATEGORY_04 TAX CLASS 1 CONDOS                           57948 non-null uint8
BUILDING CLASS CATEGORY_05 TAX CLASS 1 VACANT LAND                      57948 non-null uint8
BUILDING CLASS CATEGORY_06 TAX CLASS 1 - OTHER                          57948 non-null uint8
BUILDING CLASS CATEGORY_07 RENTALS - WALKUP APARTMENTS                  57948 non-null uint8
BUILDING CLASS CATEGORY_08 RENTALS - ELEVATOR APARTMENTS                57948 non-null uint8
BUILDING CLASS CATEGORY_09 COOPS - WALKUP APARTMENTS                    57948 non-null uint8
BUILDING CLASS CATEGORY_10 COOPS - ELEVATOR APARTMENTS                  57948 non-null uint8
BUILDING CLASS CATEGORY_11 SPECIAL CONDO BILLING LOTS                   57948 non-null uint8
BUILDING CLASS CATEGORY_11A CONDO-RENTALS                               57948 non-null uint8
BUILDING CLASS CATEGORY_12 CONDOS - WALKUP APARTMENTS                   57948 non-null uint8
BUILDING CLASS CATEGORY_13 CONDOS - ELEVATOR APARTMENTS                 57948 non-null uint8
BUILDING CLASS CATEGORY_14 RENTALS - 4-10 UNIT                          57948 non-null uint8
BUILDING CLASS CATEGORY_15 CONDOS - 2-10 UNIT RESIDENTIAL               57948 non-null uint8
BUILDING CLASS CATEGORY_16 CONDOS - 2-10 UNIT WITH COMMERCIAL UNIT      57948 non-null uint8
BUILDING CLASS CATEGORY_17 CONDO COOPS                                  57948 non-null uint8
BUILDING CLASS CATEGORY_21 OFFICE BUILDINGS                             57948 non-null uint8
BUILDING CLASS CATEGORY_22 STORE BUILDINGS                              57948 non-null uint8
BUILDING CLASS CATEGORY_23 LOFT BUILDINGS                               57948 non-null uint8
BUILDING CLASS CATEGORY_26 OTHER HOTELS                                 57948 non-null uint8
BUILDING CLASS CATEGORY_27 FACTORIES                                    57948 non-null uint8
BUILDING CLASS CATEGORY_28 COMMERCIAL CONDOS                            57948 non-null uint8
BUILDING CLASS CATEGORY_29 COMMERCIAL GARAGES                           57948 non-null uint8
BUILDING CLASS CATEGORY_30 WAREHOUSES                                   57948 non-null uint8
BUILDING CLASS CATEGORY_31 COMMERCIAL VACANT LAND                       57948 non-null uint8
BUILDING CLASS CATEGORY_32 HOSPITAL AND HEALTH FACILITIES               57948 non-null uint8
BUILDING CLASS CATEGORY_33 EDUCATIONAL FACILITIES                       57948 non-null uint8
BUILDING CLASS CATEGORY_34 THEATRES                                     57948 non-null uint8
BUILDING CLASS CATEGORY_35 INDOOR PUBLIC AND CULTURAL FACILITIES        57948 non-null uint8
BUILDING CLASS CATEGORY_36 OUTDOOR RECREATIONAL FACILITIES              57948 non-null uint8
BUILDING CLASS CATEGORY_37 RELIGIOUS FACILITIES                         57948 non-null uint8
BUILDING CLASS CATEGORY_38 ASYLUMS AND HOMES                            57948 non-null uint8
BUILDING CLASS CATEGORY_39 TRANSPORTATION FACILITIES                    57948 non-null uint8
BUILDING CLASS CATEGORY_41 TAX CLASS 4 - OTHER                          57948 non-null uint8
BUILDING CLASS CATEGORY_42 CONDO CULTURAL/MEDICAL/EDUCATIONAL/ETC       57948 non-null uint8
BUILDING CLASS CATEGORY_43 CONDO OFFICE BUILDINGS                       57948 non-null uint8
BUILDING CLASS CATEGORY_44 CONDO PARKING                                57948 non-null uint8
BUILDING CLASS CATEGORY_45 CONDO HOTELS                                 57948 non-null uint8
BUILDING CLASS CATEGORY_46 CONDO STORE BUILDINGS                        57948 non-null uint8
BUILDING CLASS CATEGORY_47 CONDO NON-BUSINESS STORAGE                   57948 non-null uint8
BUILDING CLASS CATEGORY_48 CONDO TERRACES/GARDENS/CABANAS               57948 non-null uint8
BUILDING CLASS CATEGORY_49 CONDO WAREHOUSES/FACTORY/INDUS               57948 non-null uint8
TAX CLASS AT PRESENT_1                                                  57948 non-null uint8
TAX CLASS AT PRESENT_1A                                                 57948 non-null uint8
TAX CLASS AT PRESENT_1B                                                 57948 non-null uint8
TAX CLASS AT PRESENT_1C                                                 57948 non-null uint8
TAX CLASS AT PRESENT_2                                                  57948 non-null uint8
TAX CLASS AT PRESENT_2A                                                 57948 non-null uint8
TAX CLASS AT PRESENT_2B                                                 57948 non-null uint8
TAX CLASS AT PRESENT_2C                                                 57948 non-null uint8
TAX CLASS AT PRESENT_3                                                  57948 non-null uint8
TAX CLASS AT PRESENT_4                                                  57948 non-null uint8
TAX CLASS AT TIME OF SALE_2                                             57948 non-null uint8
TAX CLASS AT TIME OF SALE_3                                             57948 non-null uint8
TAX CLASS AT TIME OF SALE_4                                             57948 non-null uint8
dtypes: float64(2), int64(2), uint8(60)
memory usage: 5.5 MB

Training and Test data

X = fdf.drop(['SALE PRICE'],axis=1)
print (X)
y = fdf['SALE PRICE']

fdf_normalized = pd.DataFrame(data=X, index=X.index, columns=X.columns)

print (y)
       TOTAL UNITS  LAND SQUARE FEET  GROSS SQUARE FEET  BOROUGH_2  BOROUGH_3  \
26286            0            2500.0             2000.0          0          1   
55023           16            4000.0             9740.0          0          0   
56763            1            2760.0             1188.0          0          0   
47664            1            2500.0             2000.0          0          1   
47236            6            2587.0             4000.0          0          1   
83254            1            9800.0             1428.0          0          0   
82112            1            7750.0             1298.0          0          0   
10044            0            2500.0             2000.0          0          0   
44548            1            2500.0             2000.0          0          1   
48289            3            1910.0             2760.0          0          1   
46216            1            2500.0             2000.0          0          1   
234              7            2469.0            19892.0          0          0   
24472            1            2375.0             1425.0          1          0   
73449            1            2905.0             1152.0          0          0   
76944            2            2880.0             1344.0          0          0   
72359            1            2190.0              784.0          0          0   
40709            3            2500.0             3150.0          0          1   
42960            2            1353.0             1428.0          0          1   
80212            2            2712.0             1800.0          0          0   
44195            1            2000.0             1372.0          0          1   
30963            1            2500.0             2000.0          0          1   
6279             1            2500.0             2000.0          0          0   
61077            0            2500.0             2000.0          0          0   
44119            1            2500.0             2000.0          0          1   
22973            0            2500.0             2000.0          1          0   
55237            1            2500.0             2000.0          0          0   
7829             0            2500.0             2000.0          0          0   
74118            1            3000.0             1744.0          0          0   
11537            1            2500.0             2000.0          0          0   
55407            1            2500.0             2000.0          0          0   
...            ...               ...                ...        ...        ...   
75092            2            2500.0             1552.0          0          0   
83271            1             899.0             1488.0          0          0   
80674            1            4000.0             2124.0          0          0   
14337            0            2500.0             2000.0          0          0   
40328            1            2500.0             2000.0          0          1   
6497             1            2500.0             2000.0          0          0   
33037            0            2500.0             2000.0          0          1   
65621            1            3760.0             1878.0          0          0   
65798            2            2500.0             4000.0          0          0   
29235            2            2003.0             1920.0          0          1   
13274            0            2500.0             2000.0          0          0   
44693            0            2500.0             2000.0          0          1   
38986            4            2700.0             3600.0          0          1   
17621            0            2500.0             2000.0          0          0   
12390            0            2500.0             2000.0          0          0   
24038            0            2500.0             2000.0          1          0   
73328            2            5000.0             1800.0          0          0   
724              0            2500.0             2000.0          0          0   
36921            0            1500.0             2000.0          0          1   
19536            2            2146.0             2760.0          1          0   
31975            0            2500.0             2000.0          0          1   
80899            2            2604.0             1400.0          0          0   
45516            1            2500.0             2000.0          0          1   
15179            0            2500.0             2000.0          0          0   
72025            1            2000.0             1624.0          0          0   
26264            0            2500.0             2000.0          0          1   
56193            1            3800.0             1624.0          0          0   
65888            1            5000.0             1073.0          0          0   
11837            0            2500.0             2000.0          0          0   
73846            2            2926.0             2082.0          0          0   

       BOROUGH_4  BOROUGH_5  \
26286          0          0   
55023          1          0   
56763          1          0   
47664          0          0   
47236          0          0   
83254          0          1   
82112          0          1   
10044          0          0   
44548          0          0   
48289          0          0   
46216          0          0   
234            0          0   
24472          0          0   
73449          1          0   
76944          0          1   
72359          1          0   
40709          0          0   
42960          0          0   
80212          0          1   
44195          0          0   
30963          0          0   
6279           0          0   
61077          1          0   
44119          0          0   
22973          0          0   
55237          1          0   
7829           0          0   
74118          1          0   
11537          0          0   
55407          1          0   
...          ...        ...   
75092          1          0   
83271          0          1   
80674          0          1   
14337          0          0   
40328          0          0   
6497           0          0   
33037          0          0   
65621          1          0   
65798          1          0   
29235          0          0   
13274          0          0   
44693          0          0   
38986          0          0   
17621          0          0   
12390          0          0   
24038          0          0   
73328          1          0   
724            0          0   
36921          0          0   
19536          0          0   
31975          0          0   
80899          0          1   
45516          0          0   
15179          0          0   
72025          1          0   
26264          0          0   
56193          1          0   
65888          1          0   
11837          0          0   
73846          1          0   

       BUILDING CLASS CATEGORY_02 TWO FAMILY DWELLINGS                      \
26286                                                  0                     
55023                                                  0                     
56763                                                  0                     
47664                                                  0                     
47236                                                  0                     
83254                                                  0                     
82112                                                  0                     
10044                                                  0                     
44548                                                  0                     
48289                                                  0                     
46216                                                  0                     
234                                                    0                     
24472                                                  0                     
73449                                                  0                     
76944                                                  1                     
72359                                                  0                     
40709                                                  0                     
42960                                                  1                     
80212                                                  1                     
44195                                                  0                     
30963                                                  0                     
6279                                                   0                     
61077                                                  0                     
44119                                                  0                     
22973                                                  0                     
55237                                                  0                     
7829                                                   0                     
74118                                                  0                     
11537                                                  0                     
55407                                                  0                     
...                                                  ...                     
75092                                                  1                     
83271                                                  0                     
80674                                                  0                     
14337                                                  0                     
40328                                                  0                     
6497                                                   0                     
33037                                                  0                     
65621                                                  0                     
65798                                                  1                     
29235                                                  1                     
13274                                                  0                     
44693                                                  0                     
38986                                                  0                     
17621                                                  0                     
12390                                                  0                     
24038                                                  0                     
73328                                                  1                     
724                                                    0                     
36921                                                  0                     
19536                                                  1                     
31975                                                  0                     
80899                                                  1                     
45516                                                  0                     
15179                                                  0                     
72025                                                  0                     
26264                                                  0                     
56193                                                  0                     
65888                                                  0                     
11837                                                  0                     
73846                                                  1                     

       BUILDING CLASS CATEGORY_03 THREE FAMILY DWELLINGS                    \
26286                                                  0                     
55023                                                  0                     
56763                                                  0                     
47664                                                  0                     
47236                                                  0                     
83254                                                  0                     
82112                                                  0                     
10044                                                  0                     
44548                                                  0                     
48289                                                  1                     
46216                                                  0                     
234                                                    0                     
24472                                                  0                     
73449                                                  0                     
76944                                                  0                     
72359                                                  0                     
40709                                                  1                     
42960                                                  0                     
80212                                                  0                     
44195                                                  0                     
30963                                                  0                     
6279                                                   0                     
61077                                                  0                     
44119                                                  0                     
22973                                                  0                     
55237                                                  0                     
7829                                                   0                     
74118                                                  0                     
11537                                                  0                     
55407                                                  0                     
...                                                  ...                     
75092                                                  0                     
83271                                                  0                     
80674                                                  0                     
14337                                                  0                     
40328                                                  0                     
6497                                                   0                     
33037                                                  0                     
65621                                                  0                     
65798                                                  0                     
29235                                                  0                     
13274                                                  0                     
44693                                                  0                     
38986                                                  0                     
17621                                                  0                     
12390                                                  0                     
24038                                                  0                     
73328                                                  0                     
724                                                    0                     
36921                                                  0                     
19536                                                  0                     
31975                                                  0                     
80899                                                  0                     
45516                                                  0                     
15179                                                  0                     
72025                                                  0                     
26264                                                  0                     
56193                                                  0                     
65888                                                  0                     
11837                                                  0                     
73846                                                  0                     

       BUILDING CLASS CATEGORY_04 TAX CLASS 1 CONDOS                        \
26286                                                  0                     
55023                                                  0                     
56763                                                  0                     
47664                                                  0                     
47236                                                  0                     
83254                                                  0                     
82112                                                  0                     
10044                                                  0                     
44548                                                  1                     
48289                                                  0                     
46216                                                  1                     
234                                                    0                     
24472                                                  0                     
73449                                                  0                     
76944                                                  0                     
72359                                                  0                     
40709                                                  0                     
42960                                                  0                     
80212                                                  0                     
44195                                                  0                     
30963                                                  0                     
6279                                                   0                     
61077                                                  0                     
44119                                                  0                     
22973                                                  0                     
55237                                                  0                     
7829                                                   0                     
74118                                                  0                     
11537                                                  0                     
55407                                                  0                     
...                                                  ...                     
75092                                                  0                     
83271                                                  0                     
80674                                                  0                     
14337                                                  0                     
40328                                                  1                     
6497                                                   0                     
33037                                                  0                     
65621                                                  0                     
65798                                                  0                     
29235                                                  0                     
13274                                                  0                     
44693                                                  0                     
38986                                                  0                     
17621                                                  0                     
12390                                                  0                     
24038                                                  0                     
73328                                                  0                     
724                                                    0                     
36921                                                  0                     
19536                                                  0                     
31975                                                  0                     
80899                                                  0                     
45516                                                  0                     
15179                                                  0                     
72025                                                  0                     
26264                                                  0                     
56193                                                  0                     
65888                                                  0                     
11837                                                  0                     
73846                                                  0                     

       ...  TAX CLASS AT PRESENT_1C  TAX CLASS AT PRESENT_2  \
26286  ...                        0                       1   
55023  ...                        0                       1   
56763  ...                        0                       0   
47664  ...                        0                       1   
47236  ...                        0                       0   
83254  ...                        0                       0   
82112  ...                        0                       0   
10044  ...                        0                       1   
44548  ...                        1                       0   
48289  ...                        0                       0   
46216  ...                        1                       0   
234    ...                        0                       0   
24472  ...                        0                       0   
73449  ...                        0                       0   
76944  ...                        0                       0   
72359  ...                        0                       0   
40709  ...                        0                       0   
42960  ...                        0                       0   
80212  ...                        0                       0   
44195  ...                        0                       0   
30963  ...                        0                       1   
6279   ...                        0                       1   
61077  ...                        0                       1   
44119  ...                        0                       0   
22973  ...                        0                       1   
55237  ...                        0                       1   
7829   ...                        0                       1   
74118  ...                        0                       0   
11537  ...                        0                       1   
55407  ...                        0                       0   
...    ...                      ...                     ...   
75092  ...                        0                       0   
83271  ...                        0                       0   
80674  ...                        0                       0   
14337  ...                        0                       1   
40328  ...                        0                       0   
6497   ...                        0                       1   
33037  ...                        0                       0   
65621  ...                        0                       0   
65798  ...                        0                       0   
29235  ...                        0                       0   
13274  ...                        0                       1   
44693  ...                        0                       0   
38986  ...                        0                       0   
17621  ...                        0                       1   
12390  ...                        0                       1   
24038  ...                        0                       1   
73328  ...                        0                       0   
724    ...                        0                       0   
36921  ...                        0                       0   
19536  ...                        0                       0   
31975  ...                        0                       1   
80899  ...                        0                       0   
45516  ...                        0                       1   
15179  ...                        0                       1   
72025  ...                        0                       0   
26264  ...                        0                       1   
56193  ...                        0                       0   
65888  ...                        0                       0   
11837  ...                        0                       0   
73846  ...                        0                       0   

       TAX CLASS AT PRESENT_2A  TAX CLASS AT PRESENT_2B  \
26286                        0                        0   
55023                        0                        0   
56763                        0                        0   
47664                        0                        0   
47236                        1                        0   
83254                        0                        0   
82112                        0                        0   
10044                        0                        0   
44548                        0                        0   
48289                        0                        0   
46216                        0                        0   
234                          0                        1   
24472                        0                        0   
73449                        0                        0   
76944                        0                        0   
72359                        0                        0   
40709                        0                        0   
42960                        0                        0   
80212                        0                        0   
44195                        0                        0   
30963                        0                        0   
6279                         0                        0   
61077                        0                        0   
44119                        0                        0   
22973                        0                        0   
55237                        0                        0   
7829                         0                        0   
74118                        0                        0   
11537                        0                        0   
55407                        0                        0   
...                        ...                      ...   
75092                        0                        0   
83271                        0                        0   
80674                        0                        0   
14337                        0                        0   
40328                        0                        0   
6497                         0                        0   
33037                        0                        0   
65621                        0                        0   
65798                        0                        0   
29235                        0                        0   
13274                        0                        0   
44693                        0                        0   
38986                        1                        0   
17621                        0                        0   
12390                        0                        0   
24038                        0                        0   
73328                        0                        0   
724                          0                        0   
36921                        0                        0   
19536                        0                        0   
31975                        0                        0   
80899                        0                        0   
45516                        0                        0   
15179                        0                        0   
72025                        0                        0   
26264                        0                        0   
56193                        0                        0   
65888                        0                        0   
11837                        0                        0   
73846                        0                        0   

       TAX CLASS AT PRESENT_2C  TAX CLASS AT PRESENT_3  \
26286                        0                       0   
55023                        0                       0   
56763                        0                       0   
47664                        0                       0   
47236                        0                       0   
83254                        0                       0   
82112                        0                       0   
10044                        0                       0   
44548                        0                       0   
48289                        0                       0   
46216                        0                       0   
234                          0                       0   
24472                        0                       0   
73449                        0                       0   
76944                        0                       0   
72359                        0                       0   
40709                        0                       0   
42960                        0                       0   
80212                        0                       0   
44195                        0                       0   
30963                        0                       0   
6279                         0                       0   
61077                        0                       0   
44119                        0                       0   
22973                        0                       0   
55237                        0                       0   
7829                         0                       0   
74118                        0                       0   
11537                        0                       0   
55407                        0                       0   
...                        ...                     ...   
75092                        0                       0   
83271                        0                       0   
80674                        0                       0   
14337                        0                       0   
40328                        0                       0   
6497                         0                       0   
33037                        1                       0   
65621                        0                       0   
65798                        0                       0   
29235                        0                       0   
13274                        0                       0   
44693                        1                       0   
38986                        0                       0   
17621                        0                       0   
12390                        0                       0   
24038                        0                       0   
73328                        0                       0   
724                          0                       0   
36921                        0                       0   
19536                        0                       0   
31975                        0                       0   
80899                        0                       0   
45516                        0                       0   
15179                        0                       0   
72025                        0                       0   
26264                        0                       0   
56193                        0                       0   
65888                        0                       0   
11837                        0                       0   
73846                        0                       0   

       TAX CLASS AT PRESENT_4  TAX CLASS AT TIME OF SALE_2  \
26286                       0                            1   
55023                       0                            1   
56763                       0                            0   
47664                       0                            1   
47236                       0                            1   
83254                       0                            0   
82112                       0                            0   
10044                       0                            1   
44548                       0                            0   
48289                       0                            0   
46216                       0                            0   
234                         0                            1   
24472                       0                            0   
73449                       0                            0   
76944                       0                            0   
72359                       0                            0   
40709                       0                            0   
42960                       0                            0   
80212                       0                            0   
44195                       0                            0   
30963                       0                            1   
6279                        0                            1   
61077                       0                            1   
44119                       1                            0   
22973                       0                            1   
55237                       0                            1   
7829                        0                            1   
74118                       0                            0   
11537                       0                            1   
55407                       1                            0   
...                       ...                          ...   
75092                       0                            0   
83271                       0                            0   
80674                       0                            0   
14337                       0                            1   
40328                       0                            0   
6497                        0                            1   
33037                       0                            1   
65621                       0                            0   
65798                       0                            0   
29235                       0                            0   
13274                       0                            1   
44693                       0                            1   
38986                       0                            1   
17621                       0                            1   
12390                       0                            1   
24038                       0                            1   
73328                       0                            0   
724                         0                            1   
36921                       0                            0   
19536                       0                            0   
31975                       0                            1   
80899                       0                            0   
45516                       0                            1   
15179                       0                            1   
72025                       0                            0   
26264                       0                            1   
56193                       0                            0   
65888                       0                            0   
11837                       0                            1   
73846                       0                            0   

       TAX CLASS AT TIME OF SALE_3  TAX CLASS AT TIME OF SALE_4  
26286                            0                            0  
55023                            0                            0  
56763                            0                            0  
47664                            0                            0  
47236                            0                            0  
83254                            0                            0  
82112                            0                            0  
10044                            0                            0  
44548                            0                            0  
48289                            0                            0  
46216                            0                            0  
234                              0                            0  
24472                            0                            0  
73449                            0                            0  
76944                            0                            0  
72359                            0                            0  
40709                            0                            0  
42960                            0                            0  
80212                            0                            0  
44195                            0                            0  
30963                            0                            0  
6279                             0                            0  
61077                            0                            0  
44119                            0                            1  
22973                            0                            0  
55237                            0                            0  
7829                             0                            0  
74118                            0                            0  
11537                            0                            0  
55407                            0                            1  
...                            ...                          ...  
75092                            0                            0  
83271                            0                            0  
80674                            0                            0  
14337                            0                            0  
40328                            0                            0  
6497                             0                            0  
33037                            0                            0  
65621                            0                            0  
65798                            0                            0  
29235                            0                            0  
13274                            0                            0  
44693                            0                            0  
38986                            0                            0  
17621                            0                            0  
12390                            0                            0  
24038                            0                            0  
73328                            0                            0  
724                              0                            0  
36921                            0                            0  
19536                            0                            0  
31975                            0                            0  
80899                            0                            0  
45516                            0                            0  
15179                            0                            0  
72025                            0                            0  
26264                            0                            0  
56193                            0                            0  
65888                            0                            0  
11837                            0                            0  
73846                            0                            0  

[57948 rows x 63 columns]
26286     219000
55023    3250000
56763     715000
47664     578920
47236    1250000
83254     610000
82112     650000
10044     670000
44548    1325000
48289    1950000
46216     478000
234      4600000
24472     410000
73449     360500
76944     389900
72359     225000
40709    2400000
42960     300000
80212     287279
44195     445000
30963    3598989
6279      385000
61077     338412
44119       5000
22973     520000
55237     545000
7829      527000
74118     569900
11537    8500000
55407     150000
          ...   
75092     825000
83271     340000
80674     695000
14337     615000
40328     670000
6497      800000
33037     356250
65621     681159
65798     225000
29235     900000
13274     706000
44693     948800
38986     660000
17621     875000
12390     850000
24038     190000
73328     790000
724      1600000
36921     100000
19536     550000
31975     255000
80899     515000
45516     885877
15179     162500
72025     405000
26264     305000
56193     900000
65888     685000
11837    3015000
73846     603750
Name: SALE PRICE, Length: 57948, dtype: int64
fdf_normalized.head(6)
TOTAL UNITS LAND SQUARE FEET GROSS SQUARE FEET BOROUGH_2 BOROUGH_3 BOROUGH_4 BOROUGH_5 BUILDING CLASS CATEGORY_02 TWO FAMILY DWELLINGS BUILDING CLASS CATEGORY_03 THREE FAMILY DWELLINGS BUILDING CLASS CATEGORY_04 TAX CLASS 1 CONDOS ... TAX CLASS AT PRESENT_1C TAX CLASS AT PRESENT_2 TAX CLASS AT PRESENT_2A TAX CLASS AT PRESENT_2B TAX CLASS AT PRESENT_2C TAX CLASS AT PRESENT_3 TAX CLASS AT PRESENT_4 TAX CLASS AT TIME OF SALE_2 TAX CLASS AT TIME OF SALE_3 TAX CLASS AT TIME OF SALE_4
26286 0 2500.0 2000.0 0 1 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
55023 16 4000.0 9740.0 0 0 1 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
56763 1 2760.0 1188.0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
47664 1 2500.0 2000.0 0 1 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0
47236 6 2587.0 4000.0 0 1 0 0 0 0 0 ... 0 0 1 0 0 0 0 1 0 0
83254 1 9800.0 1428.0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

6 rows × 63 columns

Split the data into train and test

X_train, X_test, y_train, y_test = train_test_split(fdf_normalized,y)
X_test.shape
(14487, 63)
X_train.shape
(43461, 63)

Train the model

# initialize the model
lr= LinearRegression()

# fit the model
model_fit=lr.fit(X_train,y_train)
print (model_fit.coef_)
print (model_fit.intercept_)
y= 3*Area + 7*LandSqfeet - 9xTax_Class+16
[-1.93526117e+03  1.09938376e+00  1.51695891e+00 -1.58952134e+06
 -1.11906190e+06 -1.35676903e+06 -1.57092132e+06  8.01383873e+04
  2.33315865e+05 -2.16414274e+06 -6.02943932e+04 -1.67683715e+05
  3.04148063e+06  5.68560182e+06 -2.04595243e+06 -1.93379681e+06
 -1.71505712e+06  1.65251503e+05 -1.63014307e+06 -1.06722527e+06
  3.45510299e+06 -1.33788139e+06 -7.25208862e+05 -2.18460893e+06
  7.02864152e+05  8.21351186e+05  3.53918153e+06 -1.88313563e+06
  1.63066930e+06 -1.10633429e+06  1.10187196e+05  8.64417632e+05
 -2.03417228e+05  3.02767117e+06  2.09679511e+06  1.81778072e+06
  8.96376694e+04 -7.71142596e+05  6.82083625e+05  7.63769534e+05
 -1.79297008e+06 -1.70220859e+05 -8.93666560e+05 -2.81805615e+05
 -1.83891930e+06 -2.62056702e+06  1.21997273e+06 -2.61822687e+06
 -1.72754459e+06 -1.99179836e+06 -2.24451025e+06 -2.65777004e+05
 -2.22173899e+06  2.94333424e+05 -5.99135136e+05 -4.41890764e+06
 -3.29384363e+06 -2.97796449e+05 -4.65661287e-10 -6.18578672e+05
 -2.92436926e+05  0.00000000e+00 -5.33367450e+05]
4274465.50771582

Test the model

#predict on test data
test_pred = model_fit.predict(X_test)
#Answers provided by us in the exam
#y_test => Answer Key
test_pred
array([2331681.15819392, 1763246.76111356, 1182925.97682065, ...,
        556292.18400827,  830775.68400524, 1187032.87294663])
test_null = y_test.mean()
#mean squared error
mse=mean_squared_error(y_test,test_null)

#root mean squared error
print('test rmse: {}'.format(np.sqrt(mse)))

#mean absolute error
#mae=mean_absolute_error(y_train,train_pred)
#print('train mae: {}'.format(mae))
test rmse: 1217441.6866231528
test_null = np.zeros_like(y_test, dtype=float)
test_null.fill(y_test.mean())
mse=mean_squared_error(y_test,test_null)
print('Null rmse: {}'.format(np.sqrt(mse)))
Null rmse: 1483445.904070305
len(test_pred)
14487

Model Explainability

LIME

X_test.values[4].shape
(63,)
# Import lime package
import lime
import lime.lime_tabular

#Find caegorical features
categorical_features = np.argwhere(np.array([len(set(X_test.values[:,x])) for x in range(X_test.values.shape[1])]) <= 10).flatten()
#Lime explainer for regression
explainer = lime.lime_tabular.LimeTabularExplainer(X_test.values,
feature_names=X_test.columns.values.tolist(),
class_names=['PriceCrore'],
categorical_features=categorical_features,
verbose=True, mode='regression')
ind = 4
#Fit on test data
exp = explainer.explain_instance(X_test.values[ind], model_fit.predict, num_features=6)
#Show in notebook features influencing predictions
exp.show_in_notebook(show_table=True)
Intercept -3825067.6591951344
Prediction_local [2025956.16261294]
Right: 1200453.3973587556

ELI5

# Import Eli5 package
import eli5
from eli5.sklearn import PermutationImportance

# Find the importance of columns for prediction
perm = PermutationImportance(model_fit, random_state=1).fit(X_test,test_pred)
eli5.show_weights(perm, feature_names = X_test.columns.tolist())
Weight Feature
3.4368 ± 0.0219 TAXCLASSATPRESENT_1
1.6247 ± 0.0207 BUILDINGCLASSCATEGORY_10COOPS-ELEVATORAPARTMENTS             
1.1412 ± 0.0290 TAXCLASSATPRESENT_2A
1.0982 ± 0.0217 BOROUGH_4
0.6989 ± 0.0157 BUILDINGCLASSCATEGORY_07RENTALS-WALKUPAPARTMENTS             
0.6719 ± 0.0082 BOROUGH_3
0.6370 ± 0.0085 BOROUGH_5
0.5511 ± 0.0085 BOROUGH_2
0.5004 ± 0.0022 BUILDINGCLASSCATEGORY_09COOPS-WALKUPAPARTMENTS               
0.4493 ± 0.0051 BUILDINGCLASSCATEGORY_13CONDOS-ELEVATORAPARTMENTS            
0.2710 ± 0.0035 BUILDINGCLASSCATEGORY_04TAXCLASS1CONDOS                      
0.2566 ± 0.0034 BUILDINGCLASSCATEGORY_17CONDOCOOPS                             
0.2437 ± 0.0063 TAXCLASSATPRESENT_2
0.2019 ± 0.0053 BUILDINGCLASSCATEGORY_14RENTALS-4-10UNIT                     
0.1964 ± 0.0050 TAXCLASSATPRESENT_2B
0.1054 ± 0.0000 BUILDINGCLASSCATEGORY_08RENTALS-ELEVATORAPARTMENTS           
0.1045 ± 0.0015 TAXCLASSATPRESENT_1B
0.0915 ± 0.0011 BUILDINGCLASSCATEGORY_15CONDOS-2-10UNITRESIDENTIAL          
0.0827 ± 0.0019 BUILDINGCLASSCATEGORY_12CONDOS-WALKUPAPARTMENTS              
0.0595 ± 0.0007 TAXCLASSATTIMEOFSALE_2
… 43 more …
#Understanding how each feature influences the prediction
eli5.show_prediction(model_fit, doc=X_test.iloc[[ind]], feature_names=list(X_test.columns))
Wage = 1*Education + 300000

Wage = 300002
Wage = 300003
Wage = 300040

y (score 1200453.397) top features

Contribution? Feature
+4274465.508 <BIAS>
+3033.918 GROSSSQUAREFEET
+2748.459 LANDSQUAREFEET
-1935.261 TOTALUNITS
-292436.926 TAXCLASSATTIMEOFSALE_2
-599135.136 TAXCLASSATPRESENT_2
-1067225.268 BUILDINGCLASSCATEGORY_13CONDOS-ELEVATORAPARTMENTS            
-1119061.895 BOROUGH_3

SHAP

#Import SHAP package
import shap

#Create explainer for linear model
explainer = shap.LinearExplainer(model_fit,data=X_test.values)
shap_values = explainer.shap_values(X_test)
#Understanding how each feature influences the prediction

shap.initjs()
ind = 11


shap.force_plot(
    explainer.expected_value, shap_values[ind,:], X_test.iloc[ind,:],
    feature_names=X_test.columns.tolist()
)
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
shap.summary_plot(shap_values,X_test)
_images/NYC_Property_Sales_120_0.png